About the Data

This notebook covers the exploration of data used in this project. Prior to putting this together, the following project files were run to generate the data:

  • 1extract_candidate_tweets.py (requires Twitter api key)
  • 1extract_noncandidate_tweets.py (requires Twitter api key)
  • 2transform_tweets.py
In [36]:
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns

# tweets data set
fname_tweets = 'data/data.csv'
tweets = pd.read_csv(fname_tweets)
# people data set
fname_people = 'data/People.xlsx' 
people = pd.read_excel(fname_people)
people = people.loc[people['Handle'].isin(tweets['handle'])]

On May 24th, over 300,000 tweets respective to approximately 110 Twitter handles were extracted using the Tweepy library API. These were handles of pre-identified presidential candidates and sampling of non-candidates, mostly consisting of well-known political figures. This data was collected manually from publically available online sources such as https://ballotpedia.org. Other sources are cited in the project references.

In [37]:
people.head()
Out[37]:
Key Name Party State Governor Senate House Ran 2020 2020 Join 2020 Drop Ran 2016 2016 Join 2016 Drop positions held Handle Looked for handle?
1 Abby Finkenauer-Iowa Abby Finkenauer Democratic Iowa 0 0 1 0 NaT NaT 0 NaT NaT 1 RepFinkenauer 1.0
11 Al Green-Texas Al Green Democratic Texas 0 0 1 0 NaT NaT 0 NaT NaT 1 RepAlGreen 1.0
37 Amy Klobuchar-Minnesota Amy Klobuchar Democratic Minnesota 0 1 0 1 2019-02-10 2020-03-02 0 NaT NaT 1 amyklobuchar 1.0
43 Andrew Yang-New York Andrew Yang Democratic New York 0 0 0 1 2017-11-06 2020-02-11 0 NaT NaT 0 AndrewYang 1.0
95 Ben Carson-Maryland Ben Carson Republican Maryland 0 0 0 0 NaT NaT 1 2015-05-03 2016-03-04 0 RealBenCarson 1.0

This data was used to determine which individuals announced their candidacy to run for President of the United States, the date of the announcement, and other attributes. Each observation represents dimensional data specific to a person; data includes both candidates and non-candidates. Among these attributes are included their political affiliations as shown in proportions below.

In [38]:
# are there any individuals that do not come from a government role?
from collections import Counter
roles = ['Governor','Senate','House']
Counter(people[roles].sum(axis=1))
Out[38]:
Counter({1: 73, 0: 14, 2: 20, 3: 2})

There are 14 individuals in the data that did not hold a government role such as House, Senate, or Governor. Some of these have held municipal leadership roles and others came from the business roles.

In [39]:
print(people.loc[people[roles].sum(axis=1) == 0, 'Name'].values)
['Andrew Yang' 'Ben Carson' 'Bill de Blasio' 'Carly Fiorina'
 'Donald Trump' 'Julián Castro' 'Lawrence Lessig' 'Marianne Williamson'
 'Michael Bloomberg' 'Pete Buttigieg' 'Richard Ojeda' 'Rocky De La Fuente'
 'Tom Steyer' 'Wayne Messam']
In [40]:
%matplotlib inline
sns.set(rc={'figure.figsize': (10,7)})
sns.countplot(x='Party', data=people, palette=['blue','red','orange'])
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3aa690fd0>

The majority of individuals in the data set either favored or were part of the Democratic Party followed by the Republican Party.

In addition to their party affiliation, other information includes the states in which they were primarily active.

In [41]:
sns.countplot(y='State', data=people, order = people['State'].value_counts().index)
#plot1 = sns.countplot(x='State', data=people)
#plot1 = plot1.set_xticklabels(plot1.get_xticklabels(), rotation=75, horizontalalignment='right')
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3b13df198>

The top 5 states include Massachusetts, New York, California, Texas, and Florida.

The following table shows an aggregation of key attributes by political party.

In [42]:
people.groupby('Party').sum()
Out[42]:
Governor Senate House Ran 2020 Ran 2016 positions held Looked for handle?
Party
Democratic 12 17 29 28 5 58 59.0
Independent 0 1 1 1 1 2 1.0
Republican 18 14 27 5 17 59 49.0

The majority of those who ran in 2020 were Democrats, contrasting with a Republican majority in the 2016 campaign.

The following plot compares party affiliation to their primary government role (i.e. governer, senate, house).

In [43]:
df_party = people.groupby('Party').sum()
df_party.reset_index(inplace=True)
subset = ['Party', 'Governor', 'Senate', 'House']
df_party = df_party[subset]
df_party = df_party.assign(tmp=df_party.sum(axis=1)).sort_values('tmp', ascending=False).drop('tmp',1)
#sns.set()
df_party.set_index('Party').T.plot(kind='bar', stacked=False, color = ['red','blue','orange'], figsize=(10,7))
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3ac6b37f0>

Most of the Democrats come from House or Senate roles, whereas Republicans come from House or Governor roles.

Exploring Tweets Collected

The original tweets collected were formatted as JSON, filtered for english language, and subset to include just the text and creation dates.

In [44]:
from datetime import datetime
import time
import re
import numpy as np

def get_date(created_at):
    """Function to convert Twitter created_at to date format
    Argument:
        created_at {[str]} -- [raw tweet creation date time stamp]
    Returns:
        [str] -- [date e.g. '2020-04-18']
    """
    return time.strftime('%Y-%m-%d', time.strptime(created_at, '%a %b %d %H:%M:%S +0000 %Y'))

def get_date_time(tweetdate):
    """Function to convert twitter date to date time object
    
    Arguments:
        tweetdate {[str]} -- [twitter created at date]
    
    Returns:
        [datetime.datetime] -- [date time object]
    """
    
    dt = datetime.strptime(tweetdate, '%Y-%m-%d %H:%M:%S')
    return dt
# 2020-05-25 01:17:08

def clean_tweet(tweet):
    """Utility function to clean tweet text by removing links
    , hashtags, @mentions, and numbers
        using simple regex statements. Converts text to lowercase
    
    Arguments:
        tweet {[str]} -- [tweet text]
    
    Returns:
        [str] -- [clean tweet text]
    """    
    # remove links, hashtags, mentions, convert to lowercase
    pattern = re.compile("(\\w+:\\/\\/\\S+)|(#[A-Za-z0-9]+)|(@[A-Za-z]+[A-Za-z0-9-_]+)")
    tweet = re.sub(pattern, ' ', tweet).lower() 
    # remove numbers
    words = tweet.split()
    reformed = [word for word in words if not re.match('\\d', word)]
    tweet = ' '.join(reformed)
    out = ' '.join(tweet.split())
    return out    
In [45]:
print("There are {:d} tweets collected from {:d} Twitter handles".format(tweets.shape[0], len(tweets['handle'].unique())))
There are 308423 tweets collected from 109 Twitter handles

Altogether the tweets were created over a period of several years. The following shows this distribution.

In [46]:
dates = tweets['created_at'].apply(get_date_time)
In [47]:
X,Y = np.unique(dates, return_counts=True)
plt.figure(figsize=(10,7))
plt.plot(X, Y)
plt.xticks(rotation=90)
plt.title("Tweet Creation Dates")
#plt.savefig('plt_tweet_time.png', bbox_inches = 'tight')
plt.show()

The tweets in this data set were mostly created between 2018 and 2020.

Plot of Top 10 Candidates by Number of Tweets

In [48]:
from collections import Counter
%matplotlib inline
users = tweets['user_name'].values
counter = Counter(users)
top10users = counter.most_common(10)
labels, values = zip(*top10users)
indexes = np.arange(len(labels))
width = 0.5 
plt.barh(indexes, values, width)
plt.yticks(indexes, labels)
plt.xlabel('Count')
plt.title("Top 10 Persons by Tweet Count")
#plt.savefig('plt_top_users.png', bbox_inches = 'tight')
Out[48]:
Text(0.5, 1.0, 'Top 10 Persons by Tweet Count')
In [49]:
tweet_counts = list(counter.values())
print("Mean tweet count: {:d}".format(int(np.mean(tweet_counts))))
print("Quantiles of tweet counts:", np.quantile(tweet_counts, [0, 0.25, 0.5, 0.75, 1]))
Mean tweet count: 2829
Quantiles of tweet counts: [   8. 3014. 3148. 3180. 3226.]

As a result of querying for all available tweets from each user's Twitter timeline, there are approximately 3000 tweets per individual.

2020 Candidates Tweet Counts

In [50]:
def plot_candidate_tweets(df, title):
    # get party respective to twitter handle; and colors
    partydict = {}
    for idx, row in df.iterrows():
        partydict[row['Handle']] = row['Party']
    handlelist = tweets.loc[tweets['handle'].isin(df['Handle']), 'handle'].tolist()
    counter = Counter(handlelist)
    labels, values = zip(*counter.items())
    colorlist = []
    for label in labels:
        if partydict[label] == 'Republican':
            val = 'red'
        elif partydict[label] == 'Democratic':
            val = 'blue'
        else: 
            val = 'orange'
        colorlist.append(val)
    # number of tweets by user
    indexes = np.arange(len(labels))
    plt.bar(indexes, values, color = colorlist)
    plt.xticks(indexes + width * 0.5, labels)
    plt.xticks(rotation=90)
    plt.title(title)

All Candidate Tweet Counts

In [51]:
df = people[(people['Ran 2020'] == 1) | (people['Ran 2016'] == 1)]
plot_candidate_tweets(df, "Candidate Tweet Counts")

All Non-Candidate Tweet Counts

In [52]:
df = people[(people['Ran 2016'] == 0) & (people['Ran 2020'] == 0)]
plot_candidate_tweets(df, "Non-Candidate Tweet Counts")

There are more observations of candidate tweets than non-candidate tweets.

Examine Tweet Text

In the following sections, the tweet text is analyzed by candidates and non-candidates to assess for any noticeable differences. Methods include:

  • corpus statistics
  • term frequency
  • sentiment
  • part-of-speech

Tweet text with the following transformations applied:

  • stopwords removed
  • removed links, hashtags, mentions, and converted to lowercase
  • alphabetic words only
In [53]:
from nltk import word_tokenize, FreqDist
from nltk.corpus import stopwords
stop_words = list(set(stopwords.words('english')))
stop_words.extend(['rt','amp']) # retweet / ampersand

def corpus_stats(documents):
    """Function to print tweet-level and corpus-level statistics
    These include
    - Average number of characters per tweet
    - Average number of words per tweet
    - Average vocabulary size per tweet
    - Average lexical richness per tweet
    - Total number of words
    - Total vocabulary size
    - Total lexical richness
    - Average number of characters per word
    
    Arguments:
        documents {[list]} -- [list of strings i.e. tweets]
    """
    docs = [word_tokenize(tweet) for tweet in documents]    
    # tweet statistics
    print("observation level statistics:\n")
    # average number of characters per tweet
    avg_chars = int(sum([len(tweet) for tweet in documents]) / len(documents))
    print("Average number of characters per tweet: {:d}".format(avg_chars))
    # average number of words per tweet
    avg_words = int(sum([len(doc) for doc in docs]) / len(docs))
    print("Average number of words per tweet: {:d}".format(avg_words))
    # average vocabulary size per tweet
    avg_vocab = int(sum([len(set(doc)) for doc in docs]) / len(docs))
    print("Average vocabulary size per tweet: {:d}".format(avg_vocab))
    # average lexical richness per tweet (proportion of unique words to total words)
    avg_lex_rich = sum([len(set(doc))/len(doc) for doc in docs if len(doc) > 0]) / len(docs)
    print("Average lexical richness per tweet: {:.2f}".format(avg_lex_rich))
    # corpus statistics
    print("\nCorpus level statistics:\n")
    words = []
    for doc in docs:
        words.extend(doc)
    # total number of words
    print("Total number of words: {:d}".format(len(words)))
    # vocabulary size
    print("Total vocabulary size: {:d}".format(len(set(words))))
    # lexical richness
    print("Total lexical richness: {:.2f}".format(len(set(words)) / len(words)))
    # average number of characters per word
    word_lengths = [len(w) for w in words]
    print("Average number of characters per word: {:.2f}".format(sum(word_lengths) / len(word_lengths))) 

def term_freq(documents):
    """Function to plot top 30 common terms    
    
    Arguments:
        documents {[list]} -- [list of strings i.e. tweets]
    """
    docs = [word_tokenize(tweet) for tweet in documents]        
    words = []
    for doc in docs:
        words.extend(doc)
    filtered_words = [w for w in words if w.isalpha() if not w in stop_words]
    fdist = FreqDist(filtered_words)   
    fdist.plot(30, title = 'Top 30 Most Common Words')

from wordcloud import WordCloud 

def get_wc(documents):
    """Function to plot word cloud
    
    Arguments:
        documents {[list]} -- [list of strings i.e. tweets]
    """ 
    docs = [word_tokenize(tweet) for tweet in documents]        
    words = []
    for doc in docs:
        words.extend(doc)
    filtered_words = [w for w in words if w.isalpha() if not w in stop_words]
    text = ' '.join(word for word in filtered_words)
    wordcloud = WordCloud(background_color='white').generate(text)
    plt.figure(figsize=(10,10))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.show()
In [54]:
%%time
candidate_tweets = tweets.loc[tweets[['candidate_2020', 'candidate_2016']].sum(axis = 1) > 0, 'text'].apply(clean_tweet).tolist()
noncandidate_tweets = tweets.loc[tweets[['candidate_2020', 'candidate_2016']].sum(axis = 1) == 0, 'text'].apply(clean_tweet).tolist()
Wall time: 10.2 s

Candidate Tweets: Corpus Statistics

In [55]:
%%time
corpus_stats(candidate_tweets)
observation level statistics:

Average number of characters per tweet: 93
Average number of words per tweet: 19
Average vocabulary size per tweet: 17
Average lexical richness per tweet: 0.93

Corpus level statistics:

Total number of words: 3121355
Total vocabulary size: 66101
Total lexical richness: 0.02
Average number of characters per word: 4.02
Wall time: 33 s

Non-Candidate Tweets: Corpus Statistics

In [56]:
%%time
corpus_stats(noncandidate_tweets)
observation level statistics:

Average number of characters per tweet: 94
Average number of words per tweet: 18
Average vocabulary size per tweet: 17
Average lexical richness per tweet: 0.94

Corpus level statistics:

Total number of words: 2768990
Total vocabulary size: 66551
Total lexical richness: 0.02
Average number of characters per word: 4.17
Wall time: 29.2 s

There were no substantial differences between candidate and non-candidate tweets in terms of overall text statistics in both the observation and corpus level.

Candidate Tweets: Term Frequency

In [57]:
%%time
term_freq(candidate_tweets)
Wall time: 37.4 s
In [58]:
%%time
get_wc(candidate_tweets)
Wall time: 45.9 s

Non-Candidate Tweets: Term Frequency

In [59]:
%%time
term_freq(noncandidate_tweets)
Wall time: 33.3 s
In [60]:
%%time
get_wc(noncandidate_tweets)
Wall time: 41.1 s

Term frequency distributions show a stark contrast between candidate and non-candidate tweets. The former shows a far smoother inflection in its term frequency distribution, whereas the latter exhibits a steep drop off in counts within the top three words.

Candidate tweets contain the following high-frequency terms in higher proportions than non-candidate tweets:

  • today | trump | people | president | new | us | need | great

Non-candidate tweets contain the following words in greater proportion:

  • support | vote | proud

Takeaway: very few top terms really distinguish candidates and non-candidates. Similar top terms are used by both. Distinction between classes may come via common function words (i.e. some stopwords) and patterns of less frequent words.

Candidate Tweets: Sentiment

In [61]:
def get_sentiment(score):
    """Utility function to classify sentiment of passed tweet 
        using textblob's sentiment method 
    
    Arguments:
        scores {[float]} -- [Textblob score]
    
    Returns:
        [str] -- [sentiment label]
    """    
    # sentiment labels
    if score > 0:
        return 'pos'
    elif score == 0:
        return 'neu'
    else:
        return 'neg'
def plot_sentiment(scorelist):
    """Barplot of sentiment labels
    
    Arguments:
        scorelist {[list]} -- [list of Textblob scores]        
    """
    sentlist = [get_sentiment(score) for score in scorelist]
    counter = Counter(sentlist)
    labels = counter.keys()
    values = [v / sum(counter.values()) for v in counter.values()]
    indexes = np.arange(len(labels))
    width = 0.5
    plt.barh(indexes, values, width)
    plt.yticks(indexes, labels)
    plt.xlabel('Count')
    plt.title("Sentiment Frequency")
In [62]:
%%time
from textblob import TextBlob
scorelist = [TextBlob(tweet).sentiment.polarity for tweet in candidate_tweets]
fig, ax = plt.subplots()
sns.boxplot(scorelist, ax=ax).set_title("Boxplot of Polarity")
Wall time: 1min 19s
In [63]:
plot_sentiment(scorelist)

Non-Candidate Tweets: Sentiment

In [64]:
%%time
scorelist = [TextBlob(tweet).sentiment.polarity for tweet in noncandidate_tweets]
fig, ax = plt.subplots()
sns.boxplot(scorelist, ax=ax).set_title("Boxplot of Polarity")
Wall time: 1min 11s
In [65]:
plot_sentiment(scorelist)

The distribution of polarity and proportion of sentiment between candidates and non-candidates were similar with little difference. Non-candidates wrote slightly more positive tweets, and candidates wrote slightly more negative tweets. Overall, there was not much difference.

Candidate Tweets: Part-of-Speech

In [66]:
from nltk import pos_tag

def get_pos_frequency(documents, title):
    """Function to plot POS frequency
    
    Arguments:
        documents {[list]} -- [list of strings i.e. tweets]
    """    
    wordlist = []
    for document in documents:
        wordlist.extend(word_tokenize(document))
    tagged_words = pos_tag(wordlist)
    numNoun,numVerb,numAdj,numAdv = 0,0,0,0
    for (word, tag) in tagged_words:
        if tag.startswith('N'): numNoun += 1     
        if tag.startswith('V'): numVerb += 1
        if tag.startswith('J'): numAdj += 1
        if tag.startswith('R'): numAdv += 1
    total = sum([numNoun, numVerb, numAdj, numAdv])
    height = [numNoun/total, numVerb/total, numAdj/total, numAdv/total]
    bars = ('Nouns','Verbs','Adjectives','Adverbs')
    y_pos = np.arange(len(bars))
    plt.bar(y_pos, height)
    plt.xticks(y_pos, bars)
    plt.title(title)
    plt.show()
In [67]:
%%time
get_pos_frequency(candidate_tweets, "Part-of-Speech Frequency")
Wall time: 2min 28s

Non-Candidate Tweets: Part-of-Speech

In [68]:
%%time
get_pos_frequency(noncandidate_tweets, "Part-of-Speech Frequency")
Wall time: 2min 8s

Overall differences in part-of-speech between candidate and non-candidate tweets appeared negligible. Non-candidate tweets contained more nouns and adjectives, and candidate tweets contained more verbs and adverbs.

Dendrogram of Users

At an aggregate and without concern for when tweets were created, do the tweets distinguish candidates and non-candidates?

In [69]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
import scipy.cluster.hierarchy as shc

""" define dendrogram label colors based on candidate / non-candidate """
candidates = tweets.loc[tweets[['candidate_2020', 'candidate_2016']].sum(axis = 1) > 0, 'handle'].unique().tolist()
noncandidates = tweets.loc[tweets[['candidate_2020', 'candidate_2016']].sum(axis = 1) == 0, 'handle'].unique().tolist()
all_users = candidates + noncandidates
label_colors = {}
for user in all_users:
    if user in candidates:
        label_colors[user] = 'r'
    else:
        label_colors[user] = 'k'

def get_dendro(dat):
    """Function to plot dendrogram
    
    Arguments:
        dat {[dataframe]} -- [dataframe containing 'text' by 'handle']
    """        
    handles = dat['handle'].unique().tolist() # unique users        
    usertweets = {} # dictionary of users and concantenate tweet text
    for user in handles:
        usertweets[user] = ' '.join(clean_tweet(text) for text in tweets.loc[tweets['handle'] == user, 'text'])
    tmp = pd.DataFrame(usertweets.items(), columns = ['user','text']).set_index('user')
    token_pattern = re.compile(r'[A-Za-z]+') # custom token pattern of alphabetic words
    cv = CountVectorizer(stop_words=stop_words, token_pattern=token_pattern)
    dtm = cv.fit_transform(tmp['text'])    
    feat_names = cv.get_feature_names()
    print("Vocabulary size: {:d}".format(len(feat_names)))
    doc_names = tmp.index.values 
    data_scaled = normalize(dtm)
    print("Mean feature values: {:.2f}".format(np.mean(data_scaled)))
    plt.figure(figsize=(10,7))
    plt.title("Clustering Dendrogram")
    dend = shc.dendrogram(shc.linkage(data_scaled.toarray(), method = 'ward'), labels = doc_names)
    ax = plt.gca()
    xlbls = ax.get_xmajorticklabels()
    for lbl in xlbls:
        lbl.set_color(label_colors[lbl.get_text()])
    plt.show()        
In [70]:
%%time
get_dendro(tweets)
Vocabulary size: 60906
Mean feature values: 0.00
Wall time: 23.5 s

From an aggregate perspective, tweets from candidates (red labels) were mostly clustered together. Likewise, tweets from non-candidates (black labels) clustered with other non-candidates for the most part. There were a number of exceptions:

Non-candidates in the candidates cluster:

  • SenSchumer
  • SenMarkey
  • keithellison
  • PramilaJayapal
  • IlhanMN
  • RepThomasMassie
  • TedYoho
  • GovHowardDean
  • RepGosar
  • MarkMeadows
  • RepAlGreen

Candidates in the non-candidates cluster:

  • MauriceGravel
  • GovernorPerry
  • LincolnChafee
  • JohnKasich
  • GovChristie
  • WayneMessam
  • MarkSanford
  • BobbyJindal
  • gov_gilmore
  • GovernorPataki
  • JebBush
  • RealBenCarson
  • CarlyFiorina
  • RickSantorum

Top Terms in Each Cluster Using Mean Term Frequency

In [71]:
# create scaled matrix
handles = tweets['handle'].unique().tolist() # unique users        
usertweets = {} # dictionary of users and concantenate tweet text
for user in handles:
    usertweets[user] = ' '.join(clean_tweet(text) for text in tweets.loc[tweets['handle'] == user, 'text'])
tmp = pd.DataFrame(usertweets.items(), columns = ['user','text']).set_index('user')
token_pattern = re.compile(r'[A-Za-z]+') # custom token pattern of alphabetic words
cv = CountVectorizer(stop_words=stop_words, token_pattern=token_pattern)
dtm = cv.fit_transform(tmp['text'])    
feat_names = cv.get_feature_names()
print("Vocabulary size: {:d}".format(len(feat_names)))
doc_names = tmp.index.values 
data_scaled = normalize(dtm)
Vocabulary size: 60906
In [72]:
# create clustering
from sklearn.cluster import AgglomerativeClustering

cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
cluster.fit_predict(data_scaled.toarray())
Out[72]:
array([0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1,
       1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
      dtype=int64)
In [73]:
labels = np.unique(cluster.labels_)
dfs = []
for label in labels:
    # indices for each cluster
    id_temp = np.where(cluster.labels_ == label)
    # cluster centroid
    x_means = np.mean(data_scaled.toarray()[id_temp], axis = 0)
    # indices of top 10 scores
    sorted_means = np.argsort(x_means)[::-1][:10]
    features = feat_names 
    best_features = [(features[i], x_means[i]) for i in sorted_means]
    best_feature_df = pd.DataFrame(best_features, columns = ['features','score'])
    dfs.append(best_feature_df)
# list of document, label, and cluster
clust_list = list(zip(doc_names, cluster.labels_))
In [74]:
import random
In [75]:
# top words cluster 0
clust = 0
title = "\nCluster: {:d} | {:d} documents | Top 10 Words:".format(clust, sum(cluster.labels_ == clust))
tmp = dfs[clust].sort_values('score')
tmp.plot(
    kind = 'barh', y = 'score', x = 'features', 
    color = 'blue', title = title
)
# who in this cluster?
print("Sample in cluster:\n")
print(random.sample([item for item in clust_list if item[1] == clust],5))
Sample in cluster:

[('GovMikeHuckabee', 0), ('MichaelBennet', 0), ('RepAlGreen', 0), ('JimWebbUSA', 0), ('SenMarkey', 0)]
In [76]:
# top words cluster 1
clust = 1
title = "\nCluster: {:d} | {:d} documents | Top 10 Words:".format(clust, sum(cluster.labels_ == clust))
tmp = dfs[clust].sort_values('score')
tmp.plot(
    kind = 'barh', y = 'score', x = 'features', 
    color = 'blue', title = title
)
# who in this cluster?
print("Sample in cluster:\n")
print(random.sample([item for item in clust_list if item[1] == clust],5))
Sample in cluster:

[('katieporteroc', 1), ('LincolnChafee', 1), ('Sen_JoeManchin', 1), ('JohnKasich', 1), ('EliseStefanik', 1)]

Top Terms that Distinguish Clusters Using F-Statistic on the Mean Difference

In [77]:
import scipy.stats as stats

mydf = pd.DataFrame(data_scaled.toarray(), columns = feat_names, index = doc_names)
scores = []
for column in mydf:    
    data_0 = mydf.loc[cluster.labels_ == 0, column]
    data_1 = mydf.loc[cluster.labels_ == 1, column]
    fvalue, pvalue = stats.f_oneway(data_0, data_1)
    scores.append(fvalue)
In [78]:
# indices of top 10 scores
sorted_scores = np.argsort(scores)[::-1][:10]
features = feat_names 
best_features = [(features[i], scores[i]) for i in sorted_scores]
best_feature_df = pd.DataFrame(best_features, columns = ['features','score']).sort_values('score')
In [79]:
best_feature_df.plot(
    kind = 'barh', y = 'score', x = 'features', 
    color = 'blue', title = "Top 10 Distinctive Features"
)
Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e3b55e04a8>